# arXiv Bulk Source Data Process

Process for downloading, extracting, unarchiving, and organising the bulk arXiv source data for usage in machine learning applications or data mining. These instructions outline how to download the bulk source data using `s3cmd`, unarchiving, decompressing, and organising the downloaded files, and downloading OAI metadata using `metha`.

Initial download and statistics in January-February 2019. Instructions last checked June 2020.


# Table of Contents

1.  [arXiv Bulk Source Data Process](#orgb025d53)
    1.  [Background](#orgb9ccc5e)
    2.  [Download all source files](#orgafcf743)
    3.  [Unarchive and decompress](#orgd4f7783)
        1.  [Optional: Extract images from PDF documents](#org36b138b)
        2.  [File organisation and directory structure](#orgcab9a03)
        3.  [Directory structure example:](#org34738f9)
        4.  [Directory structure (tree command)](#org778d279)
        5.  [Image totals](#org4dbf678)
    4.  [Download OAI metadata](#orgbed8cc0)
        1.  [Extract Metha gz files](#orgb9caffa)
        2.  [Example metadata](#org36df57b)
    5.  [Organise dataset](#orgfba7bb2)


<a id="orgb9ccc5e"></a>

## Background

arXiv bulk source data is made readily available, however, it requires a number of steps to download, unarchive, and decompress these files. Additionally, the source data is separate from the article metadata, requiring additional steps to link image data to article metadata such as author, category, or publication date. Here we outline the steps taken to access the bulk source data and extract the images, as well as create a SQLite database (see sqlite-method.org) that enables indexing each image together with article metadata. This enables searching across both the article metadata and image data to build custom queries, collect statistics and access subsets of this dataset.

arXiv bulk data as either PDFs or source data can be downloaded via Amazon S3 buckets. The source data for each article generally comprises the TeX or LaTeX code used to generate the article, image or vector graphics files, any other associated files, and ancillary materials. A number of articles are also stored within the bulk data as single PDF files only. We downloaded the full allotment of source files as of December 31, 2018 – a total download of approximately 1 terabyte in the form of 2150 compressed tar archive files. These compressed files were decompressed, unarchived, renamed, and organised into a folder hierarchy using bash commands on an Ubuntu Linux 18.04 desktop computer. This process extracts all tar files into separate folders; moves all PDF files to appropriately named folders; optionally retrieves images and text from PDF files; extracts any remaining gzip compressed files; and then moves remaining TeX files to individual folders.


<a id="orgafcf743"></a>

## Download all source files

Download all source files from the arXiv Amazon Web Services (AWS) data storage using requestor pays.

-   Information provided by arXiv: <https://arxiv.org/help/bulk_data_s3>
-   At the time of downloading in January-February 2019, this equated to approximately 1 terabyte of data in the form of 2150 compressed tar archive files
-   Downloading from UNSW in Australia, this download via AWS cost ~AU$150
-   Download took several days
-   Used s3cmd (<https://s3tools.org/s3cmd> and <https://github.com/s3tools/s3cmd>) for automated downloads using the &#x2013;sync function (NB: turning off MD5 hash checking greatly speeds up process, but use with caution)

```bash
# first set up s3cmd with username and password

# testing - list files
s3cmd ls --requester-pays s3://arxiv/src/ 2>&1 | tee ~/s3cmd_ls_log.txt

# get manifests
s3cmd get --requester-pays s3://arxiv/src/arXiv_src_manifest.xml
# s3cmd get --requester-pays s3://arxiv/pdf/arXiv_pdf_manifest.xml

# NOTE: sync commands take a long time to begin due to checking md5s etc.
# may not respond or output any text for some time

# dry run
s3cmd sync --requester-pays --skip-existing --no-check-md5 -v -n s3://arxiv/src/ /path/to/download/folder/ 2>&1 | tee ~/s3cmd_sync_dryrun_log.txt

# sync with md5 check -- !!! this takes a long time !!!
# s3cmd sync --requester-pays --skip-existing -v s3://arxiv/src/ /path/to/download/folder/ 2>&1 | tee ~/s3cmd_log_checkmd5.txt
# sync without check and skipping existing
s3cmd sync --requester-pays --skip-existing --no-check-md5 -v s3://arxiv/src/ /path/to/download/folder/ 2>&1 | tee ~/s3cmd_log.txt
```

Totals:

-   2150 compressed tar archive files


<a id="orgd4f7783"></a>

## Unarchive and decompress

The following code is intended to be run as a number of single line bash commands that iteratively decompressed, unarchived, renamed, and organised the data, as well as extracting images and text from PDF documents. This allows for checking if the expected result has been achieved at each step. This process will take a long time.

NB: If you want to also extract images from PDF documents, uncomment or run those commands in the appropriate sequence of commands.

See `arxiv_extract.sh` - some code reproduced here:

```bash
# after downloading all arXiv tars and placing them in ~/arXiv/src
cd ~
mkdir arXiv
cd arXiv
mkdir src_all
mkdir src
# move source tars now

# it may also be helpful to place the data on a secondary drive and create a symlink

# test - get list of all archives
for i in src/*; do echo $i; done

# for each archive, decompress into a specific folder
for i in src/*; do tar xvf $i -C src_all/; done

# change directory - remaining commands are done from here
cd ~/arXiv/src_all

# move all pdf files to their own folder
find . -maxdepth 2 -name "*.pdf" -print -exec sh -c 'mkdir "${1%.*}" ; mv "$1" "${1%.*}" ' _ {} \;

# do PDF image extraction here as it will operate only on the papers that were given only as pdf
# extract all images from pdf files
# find . -maxdepth 3 -name "*.pdf" -print -exec sh -c 'pdfimages -png "${1}" "${1}_image" ' _ {} \;

# extract text from pdf files
find . -name "*.pdf" -print -exec sh -c 'pdftotext "${1}" "${1%.*}_get.txt" ' _ {} \;

# for each archive within each subfolder
# find all gz tars, extract, and then delete the gz files
for d in *; do cd "$d" && for f in *.gz; do tar xvfz "$f" --one-top-level && rm "$f"; done; cd ..; done

# note that some of the archives are gz only and not tar
# seems to be because they only contain one file
# so for these we use gunzip which neatly replaces each .gz with a text file
find . -name "*.gz" -exec gunzip -v -q {} \;

# and for each individual (tex) file, make a folder and move the item to that folder
# note this needs to do some trickery as many of these files don't have extensions and we can't make a folder of the same name
find . -maxdepth 2 -type f -print -exec sh -c 'mkdir "${1}_dir" ; mv "$1" "${1}.srconly"  ; mv "${1}.srconly" "${1}_dir" ; mv "${1}_dir" "$1"' _ {} \;
```

Totals:

-   1,476,538 total articles (by number of folders extracted)
-   114,132 PDF-only articles (no source provided)
-   324,101 source-only articles (single source file only, no images)


<a id="org36b138b"></a>

### Optional: Extract images from PDF documents

The extraction of images is commented out of `arXiv_src_scripts/arxiv_extract.sh`. This originally seemed like an important process, as there is a decent portion of the arXiv that was not submitted as source code, however this process produced a very "dirty" dataset with a number of problems in the image files: a large number are "stripes" (images split into multiple horizontal bars) as well as lots of single symbols, strange transparency or inverted colours, and low resolution images. Only 7.69% (~113k) of all articles are submitted to arXiv as PDF only, however this process extracted over 27 million image files from PDFs, however many of these are unusable.

We made the decision to ignore this part of the dataset and proceed with using only the images found in the source uploads. This will save time and effort in cleaning the data, as well as avoiding a number of pitfalls of having such a large and messy dataset, but at the cost of not having any images extracted from PDF files.

If you do use this extraction code, each image extracted from a PDF will be given the filename extension `.pdf_image-XXX.png`, so they can be ignored or conditionally operated upon at later stages of the process.

Totals

-   Total number of articles: 1,483,662
-   Number of these that were PDF only: 114,132 (7.69% of total number of articles)
-   27,198,781 images extracted from PDFs


<a id="orgcab9a03"></a>

### File organisation and directory structure

Each article in the source directory has its own folder named by its arXiv identifier, in the format `YYMM.XXXXX` (or for articles pre-2015, 4 trailing digits in the form of `YYMM.XXXX`). Articles prior to March 2007 use the identifier `archive.subjectclass/YYMMXXX` e.g. `math.GT/0309136`. Image files are named according to the original filenames that were deposited to arXiv, e.g. `Fig4.eps`, `office_heatmap.jpg`, `figure3d.pdf` etc. Details on identifier convention at <https://arxiv.org/help/arxiv_identifier>.


<a id="org34738f9"></a>

### Directory structure example:

```bash
- arXiv
  - src_all
    - date in format YYMM, e.g:
    - 1512
    - 1601
    - 1602
      - individual article folders, e.g.:
      - 1804.04821
      - 1804.04822
      - 1804.04823
      - 1804.04824
      - 1804.04825
	- subfolders for additional code or figures, e.g.:
	- figures
	- diagrams
	- text
```


<a id="org778d279"></a>

### Directory structure (tree command)

```bash
1801/
├── 1801.00001
│   ├── Einstein_Ring.tex
│   ├── Fig_1.jpg
│   ├── Fig_2.jpg
│   ├── Fig_3.jpg
│   ├── Fig_4.jpg
│   └── Fig_5.jpg
├── 1801.00002
│   ├── 1801.00002_get.txt
│   ├── 1801.00002.pdf
│   ├── 1801.00002.pdf_image-000.png
│   ├── 1801.00002.pdf_image-001.png
│   ├── 1801.00002.pdf_image-002.png
│   ├── 1801.00002.pdf_image-003.png
│   ├── 1801.00002.pdf_image-004.png
│   └── 1801.00002.pdf_image-005.png
├── 1801.00003
│   ├── 0_285-eps-converted-to.pdf
│   ├── 0_57-eps-converted-to.pdf
│   ├── 1_4-eps-converted-to.pdf
│   ├── bubble-eps-converted-to.pdf
│   ├── e_2-eps-converted-to.pdf
│   ├── He_a.jpg
│   ├── He_c.jpg
│   ├── He_d.jpg
│   ├── ...
│   └── u_1-eps-converted-to.pdf
	...

1802/
├── 1802.00001
│   └── 1802.00001.srconly
├── 1802.00002
│   ├── draft.tex
│   ├── IEEEtran.cls
│   ├── images_anomalydetection
│   │   ├── apattern.png
│   │   ├── cnn.png
│   │   ├── football_patterns.png
│   │   ├── onehot-game.png
│   │   ├── patterns.png
│   │   ├── ROC.png
│   │   ├── scenarios.png
│   │   └── workflow.png
│   ├── main.bbl
│   └── main.tex
	...
```


<a id="org4dbf678"></a>

### Image totals

Below is a breakdown of the most common image formats. There are more images than just these file extensions, but in uncommon formats, or in formats that are a bit tricky to work with (like metapost or xfig vector graphics languages), but the numbers of these are much smaller proportions of the dataset.

```org
|-----------+----------+-------|
| extension |    total |     % |
|-----------+----------+-------|
| eps       |  4223083 | 42.01 |
| pdf       |  3299043 | 32.82 |
| png       |  1076731 | 10.71 |
| ps        |   909314 | 9.045 |
| jpg       |   485452 |  4.83 |
| pstex     |    23922 |  0.24 |
| gif       |    19054 |  0.19 |
| svg       |    12400 |  0.12 |
| epsf      |     4060 |  0.04 |
|-----------+----------+-------|
| total     | 10053059 |   100 |
|-----------+----------+-------|
```

The arXiv bulk source download includes a total of over 10 million images. These image formats are all relatively straightforward to work with and seem to give a good spread across different uses such as vector graphics (eps/svg), web (jpeg/gif), and print (ps).

Mean average of 6.81 images per article


<a id="orgbed8cc0"></a>

## Download OAI metadata

ArXiv participates in the Open Archives Initiative and provides up-to-date metadata for the full archive. Full bulk metadata is therefore downloadable using the OAI-PMH v2.0 protocol (Lagoze et. al., 2001). These records are stored in XML format and contain details such as the identifier, author, title, categories, licence, abstract, and date. The format specification of the metadata used for this project is arXiv, although other formats are available that provide slightly different data fields. As we required the entire metadata database, we used a harvesting tool to request the metadata for the entire time range, which is downloaded incrementally in blocks of approximately 10,000 items. We downloaded the metadata for all articles (records) up to the end of 2018 using the Metha OAI harvesting tool. We used [metha](<https://github.com/miku/metha>) for the OAI harvesting. Metha is a command line harvester that incrementally caches a particular endpoint. This produced 1506176 (1.5m) unique records.

To harvest all OAI records in OAI format:

```bash
# first install metha using instructions from https://github.com/miku/metha

# check metha folder -- might not give output if not previously run
metha-sync -dir http://export.arxiv.org/oai2

# run harvester
metha-sync -format arXiv http://export.arxiv.org/oai2

# run harvester with date constraints
# metha-sync -format arXiv -from 2019-01-01 http://export.arxiv.org/oai2
```

Totals:

-   1,506,176 OAI records
-   1,578 xml files


<a id="orgb9caffa"></a>

### Extract Metha gz files

OAI records are in gzipped XML format (Extensible Markup Language) using the arXiv format. This will first need to be unzipped:

```bash
# !!! back up files first, this converts in-place !!!

# change to directory with metha-downloaded oai gz files
for f in *.gz; do gunzip "$f"; done
```


<a id="org36df57b"></a>

### Example metadata

Example XML file in the “arXiv” format showing organisation of metadata according to specific tags such as "categories", "title", "author", "abstract" etc. Example shown here is Callan, David: A determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata, 2007, <https://arxiv.org/abs/0704.0004>.

```xml
<ListRecords>
    <record>
	<header status="">
	    <identifier>oai:arXiv.org:0704.0004</identifier>
	    <datestamp>2007-05-23</datestamp>
	    <setSpec>math</setSpec>
	</header>
	<metadata>
	    <arXiv xsi:schemaLocation="http://arxiv.org/OAI/arXiv/                             http://arxiv.org/OAI/arXiv.xsd">
		<id>0704.0004</id>
		<created>2007-03-30</created>
		<authors>
		    <author>
			<keyname>Callan</keyname>
			<forenames>David</forenames>
		    </author>
		</authors>
		<title>A determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata</title>
		<categories>math.CO</categories>
		<comments>11 pages</comments>
		<msc-class>05A15</msc-class>
		<abstract>  We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant.
		</abstract>
	    </arXiv>
	</metadata>
    <about/>
</record>
```


<a id="orgfba7bb2"></a>

## Organise dataset

The above steps should yield the bulk source data organised into a sensible folder heirarchy. However, this format is not easily queried. Our solution was to place the data for both metadata and image files into an SQLite database as an attempt to link this data and be able to query and analyse the dataset. This also allows adding figure captions to allow for searching across these different sets of data. For more information, see the `sql-method` document.